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ABSTRACT 

This article discusses practical considerations in 
developing tests of listening comprehension in second language 
learning with a particular emphasis on the. choice of listening 
passages and assessment tasks. The listening construct is defined as 
the process of receiving, attending to, and assigning meaning to 
aural stimuli. Questions should be developed while listening to, not 
reading, the intended passage. Selection of passage may depend upon 
orality versus literacy, audio versus video, passage length, content 
familiarity, vocabulary and sentence structure, elaborations and 
redundancies, speech rate and pauses, and fuzzy word boundaries or 
other speech phenomena* Consideration must be given to type of 
response expected, including multiple choice, true-false, open-ended 
recall, and nonverbal. Presentation .effects should also be part of 
the decision, such as advance organizers, language of instructions 
and questions, and uniformity of presentation. New tests should be 
refined by pilot testing before actual use. (Contains 41 references. 
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TESTING U2TEMIMG C0MPREHEM2I0M 

by Irene Thompson 




INTRODUCTION 

The central role of listening comprehension in second lan- 
guage (L2) acquisition is now largely accepted, and most 
modern materials and methodologies are placing an in- 
creasing emphasis on activities designed to promote the 
development of this important skill (Rubin, 1994). Listen- 
ing comprehension testing, on the other hand, continues to 
remain somew^hat of a neglected area. To begin the discus- 
sion of testing L2 listening comprehension, we first need to 
define the construct. For purposes of this discussion, I will 
adopt a verj' general definition proposed by Wolvin and 
Coakley (1985:74) that listening comprehension is "the pro- 
cess of receiving, attending to, and assigning meaning to 
aural stimuli." I will then discuss some practical consider- 
ations in developing tests of listening comprehension with 
particularempKasison the choice of listening passages and 
assessment i:asks. 

SPECIAL QUALITIES OF THE AURAL MEDIUM 

When developing tests of listening comprehension you 
should consider the special qualities of the aural medium. 
To begin with, listeners, unlike readers, cannot review and 
reevaluate information presented to them. They must com- 
prehend the text as they listen to it, retain information in 
memory, integrate it with what follows, and continually 
adjust their understanding of what they hear in the light of 
prior knowledge and of incoming information. This heavy 
processing load makes listening comprehension different 
from reading comprehension in a number of significant 
ways. 

First of all, people recall less information from listening 
than from reading in terms of both quantity and quality. 
Although the probability of recalling idea units after both 
listening and reading is influenced by their position in the 
hierarchical structure -of the text, this effect is more pro- 
nounced in the case of listening (Hildyard and Olson, 1982; 
Lund 1991a; Meyer and McConkie, 1973). Facts that are 
incidental or irrelevant to the main ideas of the text have a 
low probability of recall in listening (Shohamy and Inbar, 
1991). 

This has practical implications for testing listeningcom- 
prehension. You should put yourself in the position of the 
examinees and develop the questions as you listen to the 
passage, not as you read the transcript. This will lessen the 
likelihood of including questions that are better suited for 
testing reading than testing listening. 

SELECTING LISTENING PASSAGES 
Among aural passages are conversations, instructions, an- 
nouncements, stories, lectures, news reports, movies, plays, 
. inler\Mews, debates, speeches, interviews, and advertise- 
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ments to mention just a few. Each of these texts has its own 
special features which affect ways in which it will be pro- 
cessed and understood. 

There are many considerations in selecting suitable 
passages for testing listening comprehension. The most 
obvious ones are level of difficulty, interest, and relevance. 
Finding an authentic passage at the desired level of diffi- 
cu Ity is not easy because so many factors need to be consid- 
ered. Bear in mind that it is often impossible to predict the 
empirical difficulty of listening items on the basis of pas- 
sages alone, because difficulty resides not just in the text, 
but in the interaction of text variables with tasks, back- 
ground knowledge, memory, and inferencing ability. As a 
result, the same passage can yield items with different 
degrees of difficulty. Some of the features to keep in mind 
when selecting listening passages for testing are discussed 
below. 

Orality vs. literacy 

Oral texts can be arranged along a continuum with those 
closer to the spoken language, at one end, and those closer 
to the written language, at the other (Tannen, 1982, 1985). 
Idea units in the spoken language are typically expressed in 
short clauses, are loosely strung together, contain repeti- 
tions, and are bounded by pauses because speakers don't 
always have time to plan their utterances. Idea units in the 
written language,on theotherhand,tend to be longer, niore 
complex, and contain densely packed information because 
writers have time forplanning,editing,and revising (Chafe, 
1985). It has been demonstrated that texts closer to the oral 
end of the continuum yield higher scores on listening: 
comprehension tests than passages closer to the written 
end. Shohamy and Inbar (1991) showed that with the topic 
held constant, news broadcasts (pre-written edited mono- 
logues) were more difficult to understand than lectures 
(monologues delivered from written notes). Thompson 
(1993) reported that conversations yielded higher compre- 
hension scores than expository passages on the listening 
portion of the ETS Cotiiprehensive Russian Proficiency Test 
( 1 990). On the other hand, Berne ( 1 992) found no significant 
difference between scores on a long lecture and an inter- 
V iew on the same topic, both created from the same written 
article. Other research shows that texts are easier to under- 
stand if they contain such conversational features are re- 
peated nouns (Chaudron, 1^83), and advance organizers 
that call attention to major propositions, transitions, and 
emphases in the text (Chaudron and Richards, 1986). Other 
spoken features, such as redundancies and elaborations, 
are helpful only after learners have reached a certain level 
of proficiency (Chiang and Dunkel, 1992; Dervving, l<^89). 
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If you are planning to use authentic passages for lower- 
ability examinees, you should look for texts that are closer 
to the spoken than to the written language. In general, you 
should avoid using written materials for testing listening 
comprehension since it is quite difficult to modify them to 
make them resemble spoken language. Rather than collect- 
ing written sources, you should keep a library of recorded 
passages from radio, TV, movies, or other sources. 

Audio or video? 

If you decide to base your listening comprehension test on 
a video segment, you should consider the extent to which 
visual clues interact with the oral message (Joiner, 1990; 
Phillips, 1990). Keep in mind that visual support is particu- 
larly helpful for lower- proficiency listeners (Mueller, 1980). 
Videos vary in the extent to which they provide visual 
support that is helpful to viewers. At the one extreme are 
segments in which visuals obviate the need for listening, 
vvhileattheotherextremeare segments in which the visuals 
bear no relationship to the sound track. The extent of visual 
support varies according to genre, with dramatic segments, 
such as movies, soap operas, and TV series, providing more 
visual, action and interaction cues than interviews, speeches, 
and news, which tend to be dominated by "talking heads." 
Weather, sports, and various news reports vary in the 
amount of visual support from segment to segment, and 
country to country. High-tech American and European TV 
programs, which abound in location shots, are generally 
richer in visual cues than programs from Russia, the former 
republics, and Eastern Europe. 

Length of passages 

Heavy processing requirements imposed by the oral me- 
dium cause listeners to lose concentration rather quickly. 
Listeners report "tuning out" i f passages are more than two- 
three minutes long (Thompson and Rubin, forthcoming). 
During the field testing of the ETS Advancci^ Rui^siart Lis- 
tenmif/Readmg Test (1986) which contained a 50-minute 
listening and a v^O-minute reading portion, students not 
only did more poorly in listening than in reading, but they 
also reported greater difficulty maintaining their concen- 
tration during 50 minutes of listening than during an 
equivalent period of reading (unpublished data). 

Experience shows that listeners can attend to some 
types of oral passages longer than to others. For instance, 
dramatic TV segments, which consist of conversations ac- 
companied by action, hold listeners' attention longt**than 
TV news reports, speeches, or lectures. As a rule of thumb, 
oral passages for testing should not be longer than two or 
three minutes. 

Content familiarity 

The content of a listening passage will affect all test takers 
by making it easier to understand for those who are familiar 
with the topic, and more difficult for those who are not 



(Chiang and Dunkel, 1992; Long, 1990; Markham and 
Latham, 1987; Schmidt-Rinehard, 1992). This is especially 
true if test questions require students to go beyond the 
passage, and to make inferences based on prior knowledge 
about the subject (Buck, 1991). To minimize the effect of 
prior knowledge on listening test performance, you should 
either select passages that are neutral with respect to poten- 
tial differences in familiarity with the topic, or to include an 
extensive sampling of topics. 

Vocabidan/ 

There is little doubt that vocabulary recognition plays an 
extremely important role in listening comprehension. Pas- 
sages which contain frequently used words are easier to 
understand than passages which contain many specialized 
and technical words, idioms, and cultural allusions. Being 
able to recognize a familiar word which has little to do with 
the main idea ofthe passage can cause lower-level listeners 
to "go off on a tangent," as illustrated in the following 
example. First-year students of Russian listened to a con- 
versation between two Muscovites making plans to attend 
a friend's birthday party. Among other details, they agreed 
to meet at the "Tretyakovky" metro station. When asked 
"What is this conversation about?", some students an- 
swered that it was about going to a museum, because they 
recognized the word Tretyakovsky, the name of a famous 
art gallery. 

When selecting listening passages for lower-proficiency 
test-takers, you should make sure that some of the key 
vocabulary is recognizable, or inferable from context. Keep 
in mind, however, that familiar words and cognates are not 
always easily retrievable from dynamic speech, and that 
even fairly advanced learners may fail to understand famil- 
iar words if the latter are used in a different meaning or in 
an unfamiliar context, and may experience difficulties with 
numbers and proper names (Laviosa, 1991). 

Sentence structure 

A question test constructors often ask is "Should I simplify 
sentence structures to make the passage easier to compre- 
hend?" It seems intuitively appealing to think that syntax 
should play a major role in listening comprehension, but 
there is not enough research to answer the question as to 
whether everything else being equal, syntactica I ly complex 
sentences are harder to understand than simple ones. I31au 
(1990) round no significant effect of sentence structure 
simplification on listening comprehension of advanced ESL 
students, while Glisan (1985) found that longer, modified 
sentences were actually better understood than shorter, 
unmodified ones by advanced students of Spanish. Un- 
fortunately, there are no studies thai diMl with the f^ffects of 
syntactic complexity on the listening comprehension of 
lower-ability L2 listeners. 

There is some evidence, however, that word order ma\' 
a ffect the comprehension of speech. For instance, advanced 
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English-speakingsludentsofSpanish comprehended Span- 
ish Subjecl-Verb-Object (SVO) sentences better than VSO 
and OVS sentences (Clisan, 1985). The latter type was 
particularly difficult, leading one to hypothesize that pas- 
sages in which there are many OVS sentences (such as is 
often the case in Russian), might be difficult to process for 
speakers of English where this pattern is extremely uncom- 
mon. 

Elaborations and redundancies 

Redundancy in the form of repeated nouns ('The pencil... 
the pencil is on the table") appears to be more effective 
than other reinstatement devices, such as synonyms or 
simple topic reiterations ("This is a pencil. The pencil is on 
the table") for listeners at lower and intermediate levels of 
proficiency (Chaudron, 1983). Increased redundancy of 
information (repetition) and elaboration (paraphrase, use 
of synonyms) may not be beneficial for lower-ability listen- 
ers because lack of adequate vocabulary prevents them 
from taking advantage of redundant information (Chiang 
and Dunkel, 1992). The practical implication is that an 
authentic passage can be made more comprehensible for 
knver-proficiency learners through added repetition of 
nouns, while for more advanced listeners paraphrase and 
modifiers may be more effective. 

Insertion of various macro discourse markers referring 
to major propositions in a monologue- may also improve its 
comprehensibility. Examples of macro discourse markers 
a re "What I'm going to talk about today is....," or "Let's go 
back to the beginning." On the t/t he r hand, m icro d i scou rse 
markers, such as temporal links (after that) and causal 
cc^nnectors (therefore, consequently) signaling 
intersentential connections may have no facilitating effect 
(Chaudron and Richards, 1986; Hron et al., 1985). The 
practical implication is that a passage can be made more 
accessible if insert macro markers are inserted at major 
discourse boundaries. 

Speech rate 

There is some rather unsurprising evidence that excessive 
speed (faster than 200 wpm) impairs comprehension of 
lov\'er-inlermediateESLIearners(Griffith, 1990). These learn- 
ers seem to perform best at a slower rate of around 1 20 wpm 
(Griffith, 1992; Kelch, 1985). On the other hand, more ad- 
vanced listeners appear to be affected not so much by rate 
of speech as by other factors, such as text type, task, and 
prior knowledge (Blau, 1990; King and Behnke, ] 989). Keep 
in mind that research evidence is limited and conflicting 
because studies use different subjects, languages, texts, 
tasks, definitions of "normal" rate for different languages, 
(ind measurement techniques. However, it seems reason- 
able to assume that passages delivered at high speech rates 
are, probably, not suitable for examinees at lower levels of 
proficiency. 



Pauses 

Since spoken language tends to be relatively seamless and 
continuous, pauses act much like punctuation marks do in 
writing to break up the spoken signal into constituents. 
Therefore, one would assume that pauses should help 
listeners process the message more easily. However, stud- 
ies indicate that there appears to be a threshold of language 
proficiency below which pauses do not aid listening com- 
prehension. For instance, pause insertion did not increase 
the comprehension of lower-ability students (Jacobs et al., 
1988), but inserting longei (han normal pauses at clause or 
sentence boundaries helpe advanced listeners to compre- 
hend expository passages » 'or^ than slowing down the 
speech rate (Blau, 1990, 199]^ 

Fuzzy word botnidaries and other dynamic 
speech phenomena 

Words in dynamic speech undergo various transforma- 
tions through assimilation, vowel reduction, consonant 
weakening, liaison, and syllable contraction, so that even 
native listeners haveoccasional difficulty in reconstructing 
citation forms from a stream of speech (Hieke, 1987). In 
addition, units in dynamic speech, i.e., uninterrupted 
stretches of speech between pauses, are much longer than 
citation forms, i.e., units corresponding to single words. 
According toCarteretteandJones (1974:367), dynamic forms 
contain an average of twelve phonemes, as compared to 
citation forms that contain an average of iust three. L2 
listeners whose initial exposure is often to L2 words spoken 
in isolation, fail to recognize even highly familiar words in 
running speech because their limited knowledge of the 
language does not allow them to compensate for missing 
phonological information due to assimilation, contraction, 
liaison, and elision (Henrichsen, 1990). In Russian, words 
can change both in terms of the number of syllables and in 
vowel and consonant quality. Thus, [stol] can be buried in 
[nastAl'e]. This is one more reason why one should not 
depend on written transcripts when selecting listening 
passages. One should listen, instead, to the spoken version 
to decide whether the passage contains too many phono- 
logical transformations to be suitable for lower-proficiency 
learners. You may need to re-record a passage in which key 
vocabulary' items have undergone such significant sandhi- 
transformations as to be inaccessible to lower-level listen- 
ers. 

DESIGNING ASSESSMENT TASKS 

if you want to interpret scores on tests of listening compre- 
hension as indicators of listening ability, you must make 
sure that these scores measure listening ability and not 
much else. This means that you should minimize potential 
sources of measurement error, i.e., factors other than listen 
ingcomprehension. Various sources of measurement error 
in testing listening comprehension are discussed below. 
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Memory 

Memorx' is an inseparable part of comprehension. How- 
ever, its role in listening may be different from its role in 
read ing. In reading, the examinee can refer back to portions 
of the text that contain information necessar)' for answering 
a question. In listening, however, the examinee cannot re- 
access the text when attempting to construct an answer. 
This means that you should consider the extent to which a 
question may overburden the exami nee's ability to remem- 
ber textual information (Thompson, 1993). A listener may 
have comprehended what was being said at the time of 
listen ing, but by the time he or she got to the quest ion(s), the 
memory trace may have been erased by subsequent infor- 
mation in the text, and by having to read the question and 
answer options. In real life, note-taking is of considerable 
help to listeners, but under the time constraints of a testing 
situation, careful note-taking may not always be possible. 
An example from an experimental Russian listening com- 
->fM=elTension test taken by 100 students (unpublished data) 
shows why two questions based on the same passage have 
different difficulty levels due to differential memory load. 
After having listened to a weather report, students were 
asked twomultiple-choice questions which are reproduced 
below: 

1 . The forecast calls fi^r 2, The current temperature in Moscow is 



(A) sunshine 

(B) light snow 

(C) partial overcast 

(D) thick fog 



(A) b degrees 

(B) 10 degrees 

(C) 13 degrees 

(D) 19 degrees 



Ninely-six percent of the examinees answered the first 
question correctly, i n contrast to the second question which 
was answered correctly by seventy-eight percent of the test- 
takers. Why was thesecond question more difficult than the 
first one? One possibility is that the answer to the first 
question depended largely on being able to recognize a 
specific vocabulary item, while the response to the second 
question required the examinees to recall which number 
corresponded to the current temperature, as opposed to 
barometric pressure, wind velocity, and nighttime tem- 
perature, M of which were also mentioned in the forecast. 
This means that you should make an effort to design items 
that do not require listeners to recall incidental details (Aly, 
1V93). 

hiferencing and other mental operations 
Test questions measure not only comprehension but also 
the ability to draw inferences, solve problems, and make 
deductions from text content. An example from a Russian 
tost shows how cognitive demands can affect item diffi- 
culty. After listening to a monologue about Pasternak's 
novel Doctor Zhiva^fO, examinees were asked three ques- 
tions which test developers predicted to be roughly equiva- 



lent in difficulty. The results of the field test proved them 
wrong. Two of the questions which dealt with information 
that was explicitly stated in the monologue were answered 
correctly by about half of the test takers. However, only ten 
percent of them were able to answer the third question 
which required them to make an inference. This suggests 
that test developers should keep in mind that the m.ore 
complex the mental operations involved in arriving at the 
correct answer, the more difficult the listening item is likely 
to be. 

TYPE OF EXPECTED RESPONSE 

Listeners' performance will be affected by the type of re- 
sponse that is required of them. Among the most commonly 
used responses are selected responses and constructed 
responses. Selected responses do not require test-takers to 
create a response, merely to select an the most plausible 
option. Constructed responses require test-takers to pro- 
duce their own answers. Berne (1992) found that students of 
Spanish received significantly higher scores on a multiple- 
choice version than on either an open-ended or cloze ver- 
sions of the same test, but no difference between open- 
ended and cloze versions. In a validation study of the 
ACTFL Russian proficiency guidelines, the mean score for 
multiple-choice questions was higher than that for open- 
ended items. 

The advantages and disadvantages associated with 
different types of responses are discussed below. 

Multiple-choice questions 

Multiple-choice questions have several advantages. In the 
first place, they are easy and fast to score because no 
judgment is required on the part of the scorers. Secondly, 
multiple-choice items require a minimal amount of time to 
complete, therefore, multiple-choice tests can include many 
items, which enhances test reliability. Thirdly, multiple- 
choice items minimize the confounding of listening with 
speaking or writing because they have no production re- 
quirements, even though reading remains a confounding 
factor. All these features make multiple-choice tests practi- 
cal in situations that require testing of large numbers of 
individuals. However, there are a number of disadvantages 
as well. First, multiple-choice items invite guessing. Sec- 
ondly, important parts of a passage sometimes cannot be 
tested simply because three plausible distractors cannot be 
found. Last, but not least, good multiple-choice questions 
are extremely difficult to write. Common problems include 
clues pointing to the right answer,confusing or implausible 
distractors, insufficient numberof distractors — ideally, Iherv 
should be one correct answer and three distractors , un- 
clear or lengthy wording, negative wording, and more lh,m 
one correct option. 
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True-false questions 

True-false items are easier to write than multiple-choice 
questions, but the examinee has a fifty-percent chance of 
being correct by guessing. Because both multiple-choice 
and true-false responses encourage guessing, it is common 
practice for test instructions to state whether or not there is 
a penalty for guessing, and what that penalty is. 

Open-ended questions 

Open-ended questions avoid some of the problems associ • 
ated with multiple-choice items. In the first place, they 
invite guessing less than multiple-choice items. In the sec- 
ond place, they allow test constructors to ask any question, 
not just a question for which four plausible multiple-choice 
options can be designed. However, open-ended questions 
do not always work as intended because more than one 
ansvvercan sometimes be reasonably interpreted ascorrect. 
Thisoften happens when the answerdependsonextratextual 
information — a situation which frequently arises in con- 
nection with higher-level questions. Since test-takers differ 
in terms of background knowledge, it is sometimes difficult 
to predict what their answers might be. Here is an example 
of a poorly designed open-ended question. After listening 
to an inter\' iew with a literary critic, test-takers were asked: 
"What is Solzhenitsyn's role as a writer?" Some test-takers 
based theiranswerson prior knowledge about Solzhenitsyn 
and not on what was actually stated in the inter\'iew. As a 
result, it was di fficult to decide whether some answers were 
acceptable or not. To solve this problem, the question was 
re-worded to read: "What arguments did the interviewee 
use to support heropinion about Solzhenitsyn's writing?" 
This formulation indicated to the test-takers that their an- 
swer had to be based on information contained in the 
interview. Asa result, the range of responses was narrowed, 
and scoring was made easier. 

Another problem with open-ended questions arises 
when there is insufficient indication of just how much 
information should be included in the answer (Buck, 1991). 
Here is an example. Students listened to a miMiologue in 
which the speaker outli'u^d a program for economic re- 
forms in Russia. They were asked "How does the speaker 
propose to change Russia's economy?" Answers ranged 
from skeletal ("He advocates capitalism") to relatively de- 
tailed ("He suggests that state enterprises be converted to 
private ownership; he also wants the government to attract 
foreign investments and to control inflation"). Binary 
(right/ wrong) scoring would have been inappropriate in 
this case because bt)th answers are correct. One solution is 
tmievelop a scalevvhich awards points based on the amount 
of ci^rrect details in the answer. This solution requires test 
developers to prepare a list of all propositions in the passage. 
1 he other solution is to re-word the question: "List at least 
two economic measures advocated by the speaker " This 
wording tells examinees how much information is expected 
in their response. 

Yet another pn^blem in scoring open-ended questions 



is presented by partially correct answers. One possible 
solution is to ask several highly proficient listeners to inde- 
pendently answer the questions, and to compile a list of 
their answers. The list is then given to the scorers to reduce 
the number of decisions they have to make. This may still 
leave the scorers with a small number of "far out" answers 
which will need arbitration. 

Recall protocols 

Recall protocols are normally adm.inistered in the following 
way; (1) a brief listening passage is recorded at nonnal 
speed; (2) a list is prepared of all facts or propositions 
contained in the passage; (3) students listen to the passage; 
(4) they are asked to write down everything they remember 
from the passage. More points may be awarded for recall of 
higher-level propositions than for details (Bemhard and 
James, 1985). Critics of this technique argue that it con- 
founds listening comprehension with memory ability. In 
add ition, recall protocols rely on writing on writing — a skill 
which may beeven less developed than listening. Examinees 
may be reluctant to write down what they have understood 
if they are unsure of the grammar and spelling. The solution 
is for students towrite the protocols in their native language. 
Finally, scoring of recall protocols is labor-intensive and 
requires training to ensure inter-rater reliability- 

Non-verbal responses 

Language teachers like to argue about the use of LI in the 
classroom, and this argument spills over into discussions of 
testing proced ures. Purists insist that LI should be avoided 
at all costs, while pragmatists maintain a "whatever works 
best" position. From a psychometric perspective, the lan- 
guage of response is a source of measurement error because 
we cannot determine how much of the variation in the 
scores is attributable to listening comprehension, and how- 
much to writing or speaking ability. Examinees mav have 
understood a passage but were unable to demonstrate their 
comprehension through speaking or writing in L2. For this 
reason, at lower levels i^f proficiency, non-verbal responses 
are especially useful. A few examples of such responses are 
given below: 

Tesl-lciker hiMrs: A dt^vTiptUm <.t ti house, .i pv-»rsoii. or we.)tlit>r 
Test Iciker sei's: Pictures of four djfYerent houses, persons, or 
\N iMther SL'ont's. 

CirLif tlu' pi'.turi' XUm corresponds to the df- 
siTiptii>n. 

Test toker hears: A n.irrtitiN e dlxml .i specihc event. 
Test taker sees; I'ulures representui^ sci'nes from thi' no rr.it ivr 
^♦^''^^ IM.)u' piiluri's jti I hronolo>;K A ordrr. Im^inI on 

thi' nc)rri)ti\ e 

Test t.iker he.)rs A n.irr.itui' uitha iltMr stnr\ li:.e 
losl-lokersees. Putures ot tour possible (Hitcornes of thi'sti)r\ 
T)sk: Select outcome most umMstent with Ihi- siorv. 



Test-taker hears: A lecture on demographics. 
Test-laker sees: Graphs of charts representing different popu- 
lation trends. 

Task: Select graph or diart representing infurmalion 

in the lecture. 



Test-taker hears: Directions how to get somewhere. 
Test- taker sees: A city map. 

Task: Draw a line to indicate the route described in the 

directions. 

PRESENTATION EFFECTS 

Presentation effects have the potential of confounding lis- 
teningcomprehension with understanding instructionsand 
lest questions, as well as with differences in test administra- 
tion. Some of the most obvious and controllable sources of 
error are described below. 

Advance organizers 

Listening in the real world normally occurs in context 
which helps listeners eliminate potentially ambiguous in- 
terpretations of the message, and to infer the meaning of 
unclearly heard or unfamiliar words or phrases. In addi- 
tion, listeners normally have a purpose for listening in 
mind. This helps them decide what to concentrate on, and 
how to listen. In an effort to duplicate these conditions in 
test situations, it is common practice to give test-tnkers 
prelistening questions (Bacon, 1 991 ). Lund (1 991 b) reported 
that listeners who were told to understand as much as they 
could and then write a recall protocol recalled fewer main 
ideas, fewer details, and produced more inappropriate 
interprelationsof the text than listeners who were told what 
to focus on before they listened to a passage, Lund believes 
that unfocused instructions gave listeners little help in 
determining what to concentrate on, so that they tried to 
process everything indiscriminately. Respondents in an 
introspective study by Buck (1991) reported that question 
preview influenced their listening strategies, and made 
listen ing easier for them. However, Buck suggested that the 
effect of prelistening questions may, in fact, depend on the 
passage. Such questions may be helpful when listening to 
expository passages, crammed with facts, but not when 
listening to interesting stories with a clear story line. Note, 
however, that there are no empirical studies comparing the 
effects on listening comprehension of questions before lis- 
tening with questions after listening. 

Language of instructions and language of questions 
The potential for reduced reliability of a listening test is 
even greater when it comes to presenting in Irurlions and 
lest questions in L2, especially in the case of lower-profi- 
ciency examinees, since it is impossible to determine how 
much of the variation in their listening scores can be attrib- 
uted to their L2 listening ability and how much to their L2 
reading comprehension. Whether you decide to present 
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instructions and questions in LI or L2, keep the wording 
short and simple, since your purpose is to test listening 
comprehension, not read ing ability. If you decide to present 
questions in L2, keep in mind that it is difficult to simplify 
the language of multiple-choice items. It is also a good idea 
to offer a sample passage for practice lo ensure that test- 
takers understand what is expected of them. A sample 
question provides a warmup for students who may other- 
wise miss answering the first lest question while trying to 
adjust to the format of the test. 

Unifonmty of presentation 

You should make sure that you standardize the way you 
administer your listening test. If you present a listening 
passage live to several classes, it will not be possible for you 
to account for variations in speed, loudness, emphases, 
pauses, acoustics, and background noise. If your test is 
administered by different instructors, there will also be no 
way to account for the potential impact of the d ifference in 
their voices. Therefore, it is essential that you record the 
passages you want to include in your test. 

In addition, you should keep constant the number of 
times the passages are repeated, as well as time to complete 
responses. You should keep in mind that repeated presen- 
tationsof a listening passage will not be particularly helpful 
to low-level listeners, whereas advanced listeners will be 
more likeiy to profit from hearing the passage several times 
(Lund, 1991a). In any case, the number of repetitions should 
be kept constant from one test administration to another. 

It isalsoessential that you giveexactly the same instruc- 
tions to all groups of test-takers. For instance, if one group 
is warned that there is a penalty for guessing, and another 
group is not, examinees in the two groups will adopt 
different test-taking strategies and that, in turn, will affect 
their test performance. 

REFINING YOUR TEST 

Chances are that the first time you use a new test, some of 
the items will turn out to be unreliable. A few relatively 
simple steps can go a long way towards increasing the 
reliability of your test without doing complicated and lime- 
consuming statistical analyses. First, give the test lo a few 
people without having them listen to the passages to find 
out if they can correctly answer any of the questions without 
the benefit of having heard the passages. If they can answer 
some of the questions correctly, it means that they are based 
on extralextual information and can be answered solely on 
the basis of familiarity with the topic, logical reasoning, and 
other lypesofexlralinguislic knowledge. These items should 
be discarded. 

Secondly, pilot the test in one of vour classes, and 
analyze the results. Specifically, look for good anil bad 
items. In norm-referenced tests, a good item is one of 
average difficulty, i.e., one which is answered correctly by 
about fifty percent of the examinees. In addition, a good 
item is one which correlates well with the total scores, that 
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IS, it ranks test-lakers in approximately the same way as the 
ti>tal test scores. Items that were answered incorrectly by 
test-takers who generally did well on the test, and items that 
vvereanswered correctly by those who did poorly on the test 
as a whole should be discarded or re-worded. In addition, 
items that were answered correctly or incorrectly by iriost 
examinees are non-discriminating and, they too should be 
discarded. If you repeat this procedure several times, you 
will end up with a test that is reliable enough for purposes 
of formative evaluation. However, if more important deci- 
sions ride on the results of the test, you should consider 
adopting a standardized test, or seek the help of a psycho- 
metrician. 

CONCLUSION 

In this paper, 1 have made some practical suggestions on 
how to make classroom tests of listening comprehension 
more valid and reliable through careful selection of listen- 
ing passages, and creation of listening tasks that reflect 
cognitive operations involved in real life listening. These 
suggestions must be construed as tentative pending the 
development of a more fully elaborated model of listening 
comprehension. 
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